Credit Card Users Churn Prediction

Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

Data Dictionary:

Load All libraries needed for this project

There are 10,127 rows and 21 columns in the data.

We will make a copy and safeguard the data before we do any further analysis.

Initial Data Inspection

There are 6 columns with strings in them, and the rest of them are all numerical column.

Two columns -- Education_Level and Marital_Status has a bunch of null data in them. The rest of the columns have non-null data. However, we should make sure that these are not full of NaN data.

There are no duplicates. The CLIENTNUM column is a unique identifier of the customer and adds no value to the analysis. We can drop that column.

15% of records have missing data for Education_Level. 7.4% of records have missign data for Marital_Status.

We will need to look closely at them and impute data as needed.

Data Pre-processing

There is nothing unusual about the data in the rest of the columns. So, this must be data-entry error, or perhaps the customer did not want to reveal their income.

Since there are 1112 records, which makes up almost 11% of records, we cannot drop these records. Let's replace them with NaN and impute later on.

Exploratory Data Analysis

Note

I am borrowing/copying some utility functions that were used in the MLS sessions instead of trying to create them on my own.

I am borrowing utility functions for univariate analysis as well as for bivariate analysis.

Univariate Analysis

First I will do some histogram and boxplots for the numerical variables that are continuous in nature.

Then I will do barplots for some of the numerical data as well as categorical data.

Mostly a normal curve with a couple of outliers. Customers can be ago 70 or above, so this doesn't need any outlier treatment.

This is another normal distribution with a few outliers at both ends. The outliers look reasonable for the number of months of relationship with bank, so no outlier treatment required here.

The distribution has a sharp right skew as we had previously expected. There are a number of people with very high credit limits, which may correspond to high income individuals.

This is a normal distribution in the middle, with sharp peaks at the two ends. The 0 balance records need to be looked at to see if there is anything unusual.

The zero revolving balances means that quite a significant number of people don't carry credit card balances from month to month and pay off in full every month.

I see that there are all kinds of different values for credit limits for these individuals. This simply means that a lot of people actually do not carry a balance from month to month on their credit cards, and they pay off their credit card balances in full.

This doesn't look suspicious and may be real data. 2740 people, or 27% of the customer base, pays off their credit card balances in full every month.

Another heavily right skewed data. Interestingly, the pattern is almost identical to the credit limit.

Initially I was thinking that people are maxing out credit cards, hence the "open to buy" has such a right skew. But since this pattern is so similar to the credit limit, the opposite is true - people are not utilizing much of the credit limits and are perhaps not using the credit cards heavily.

The utilization rate will give me a better picture.

The utilization rate confirms that a majority of people are not really utilizing much of their credit limits, hence they are not using the credit card very much.

The total transaction count is a normal distribution, but the total transaction amount has 4 peaks. That means that the data really has overlapping categories of customers, perhaps different earning ranges, who utilize the cards in different ways.

Ideally, these different categories of customers should be separated out and analyzed in their own groups. But I will not have the time to do that.

The Q4 to Q1 amount change and count change graphs are normal distributions but they both have significant outliers. Need to clip the data as part of feature engineering to handle the outliers.

Now for some analysis of the categorical data as well as numerical data that are discrete in nature.

As we have previously noted, there is a 5:1 ratio of people who keep the credit card and those who drop it.

Most people have 2-3 dependents (spouse and/or children).

The biggest population of customers are graduates, with the second biggest having education till high school. There is a significant population of uneducated people as well.

Married customers make up 50% of the customers.

The majority of customers have income less than $40K.

Most people have the Blue card, which is probably the entry level card. Other cards, which may have more benefits, and perhaps annual fee, etc. is a lot less common. Blue cards make up 93% of all cards issued.

Most customers tend to keep 3 or more types of accounts/products with the bank. Only 9% of customers have a single account/product with the bank.

It is not unusual for customers to not have interaction with bank every month. A majority of customers have between 2-3 months of inactivity.

Most customers have between 2-3 interactions with the bank in a 12 month period.

Bivariate Analysis

Observations

The heatmap and the pairplot both show that there isn't a clear pattern or correlation between those who continue to keep the credit card and those who have attrition.

Hence, we will need to look at each individual combination separately.

Once again borrowing function from MLS class instead of creating my own

People who have had accounts for shorter period of time are more likely to have an attrition.

People with Doctorate degrees have a slightly greater chance of attrition. Other than that, there doesn't seem to be a huge variation by level of education.

Women are slightly more likely than men to renounce their credit cards than men.

Dependent count doesn't seem to affect attrition.

Marital status doesn't seem to have an impact on churn.

Interestingly, people at the lowest end of income as well as at the highest end of income appear to be slightly more likely to renouce their credit cards.

We have a clearer pattern here. People with platinum and gold cards are more likely to give up their credit cards than those with Blue or Silver cards.

Another pattern emerges here. People with fewer number of products with the bank are more likely to have a churn - especially people with just 1 or 2 products.

There is an interesting pattern here.

This is very interesting. As the number of contacts between the bank and the customer increases, the likelihood of churn goes up quickly. In fact, at 6 contacts there is 100% churn.

Function borrowed from MLS class.

There is a big overlap of people who stay and those who renouce for the credit limit range. There is no clear pattern.

Attrition appears to be lower for people who have larger revolving balances. Perhaps it is more difficult for them to get out of debt and they keep their credit cards.

People with fewer transactions have a higher likelihood of churn.

Ignoring the outliers, this shows that people whose transactions dropped by 40% or more in Q1 compared to Q4 are more likely to renounce their credit cards.

On the flip side, people whose transactions increased in Q1 compared to Q4 are likely to keep their credit cards.

Summary of EDA

There are certain risks that are starting to show up.

Feature Engineering

Drop all the columns that we identified as redundant

We had previously identified that the Education_Level and Marital_Status columns will need to be imputed. In order to use the KNN imputer, we will need to convert the categorical data into numbers as required by the imputer. Later on, we will restore the original values through a reverse mapping.

Split up data for modeling. Also ensure that imputing happens in separate groups of data

Building the model

Model evaluation criterion:

Bank wants to know which of the customers is likely to give up their credit cards. The bank needs to follow up with them or generally engage with them so that they continue to be a credit card customer.

As a result, it is very important to minimize False-Negatives.

The best way to minimize False Negatives is to maximize the Recall Score

Approach

I will create 6 different models, and do K-fold cross-validation. Then I will verify the result on the Vaidation set. And finally, if the results look acceptable and generalizable, then I will test against the test data to evaluate results on unseen data.

All the models are working better on the validation data than on the training data.

My 3 best performers are XGBoost, GradientBoost, and AdaBoost.

Copying utility functions from MLS class to render the confusion matrix, etc.

Improve the AdaBoost model

This AdaBoost model is actually overfitting the training data, and capturing noise. The performance of recall score on the validation set has degraded from that of the training set.

Improve the XGBoost Model

This has given me consistent results between the training set and the validation set at 0.93. This result is expected to be genralizable to test data set.

Improving the GradientBoost model

This model degraded significantly from the training data set to the validation data set. So this model is overfitting the training data

Let's Oversample the data to account for the bias for the value of 0 for the Attrition_Flag

Oversampling using SMOTE

Evaluate tuned AdaBoost Model on the oversampled data

Oversampling has not helped with AdaBoost model.

Evaluate tuned XGBoost Model on the oversampled data

Evaluate tuned GradientBoost Model on the oversampled data

Once again, the recall score has degraded in the validation model. So this model is overfitting the data.

Now to evaluate the tuned model on undersampled data.

In this case, we will remove entries from the majority target variable.

Evaluate Tuned AdaBoost model with undersampled data

Model is still overfitting, but the performance on the validation model is better.

Evaluate Tuned XGBoost model with undersampled data

Evaluate Tuned GradientBoost model with undersampled data

The recall on the validation set is 0.896 although the model is clearly overfitting the training model.

Finally let's test the models on the test set

I am getting the best Recall performance on the test data from the GradientBoost model with the training data undersampling, which is giving me 0.926 recall and an accuracy of 0.908.

Relative Feature Importance

Business Recommendation

Based on the relative imporatnce of the various features, top 5 actions: